Photo by Bryony Elena on
Unsplash
The travel and tourism is an important sector to the UK that employees over 300 million people and in 2022 contributed roughly £237 billion to the country’s gross domestic product, representing approximately 10% of the UK’s economy. An essential part of this sector are visitor attractions, ranging from castles and stately homes, zoos and aquariums, to botanical gardens and national parks. In this lab we will explore data published by the Association of Leading Visitor Attractions on visitor numbers in 2022 to a range of tourist attractions across the UK.
Before we get stated, please ensure that you have RStudio installed, a GitHub account and are able to push and pull correctly. If not, please follow the set-up instructions here. Ask a tutor for help if you have any problems.
Find your workshop group. If you do not know what group you are in, go to your course timetable and you will see that the title of this class is “Introduction to Data Science - Workshop/<nn>” where “<nn>” is your group number.
Please help each other in your group in answering the workshop exercises and de-bugging each other’s code. If you are stuck or unsure as to what you need to do, raise your hand and a tutor will come over to help you.
Log onto GitHub and create a new repository by cloning today’s lab template project. To remind you of the step:
lab-01,
and click on Begin import.Open RStudio and create a new version control project using the GitHub repository you have just made. To remind you of the steps:
Open the R Markdown document lab-01.Rmd and change the
author to your name. Knit the document and make sure that it
complies correctly without any errors.
✅ ⬆️ Commit and push your changes to GitHub with an appropriate commit message. As a reminder of the steps:
Git tab (top right panel) and click on
Commit.❗ Make sure you are familiar with the above steps as you will need to do this for you homework and future labs. Furthermore, remember do good version control practices by periodically keeping your repository on GitHub up-to-date by staging and committing any substantial changes, and then push them to GitHub.
In today’s lab we will using the tidyverse package and its dependencies. At the top of your template you will find the following code that will load this library whenever you knit your document.
If you do not have this library installed, then your document may fail to knit. To install any R packages:
Packages tab (bottom right panel)Install buttontidyverse), and ensure that the option
Install dependencies is ticked.Install button.Alternatively, you can install the tidyverse package by
running the command install.packages("tidyverse") once on
the console (bottom left panel). Do not write this
command in your Rmd file, otherwise as RStudio will attempt to
re-install the package each you knit your document!
RStudio will then download and install the tidyverse
package and any other dependent packages. Read the messages that are
printed to the console and check that there are no installation
errors.
The dataset for this assignment can be found as a comma separated
value (.csv) file in the data folder of your
repository. You can read it into your work using the following code. (We
will discuss more about importing and exporting data next week.)
Use the head() command as demonstrated below to view the
first 10 rows of the data set.
The variable in the visitors data set are:
| Variable | Description |
|---|---|
attraction |
Name of the tourist attraction. |
n_2021 |
number of visitors in 2021. |
n_2022 |
number of visitors in 2022. |
admission |
Whether admission is “Free” for all, free for “Members” or everyone is “Charged”. |
setting |
Whether the attraction is “O”utside, “I”nside or has a “M”ix of inside and outside settings. |
region |
Which region of the UK the attraction is located. |
Let’s first inspect the data. Load the data into your environment by
click on the green play button at the top-right corner of the R code
chunks, and then on the console (bottom left panel) type
View(visitors). This will open a new tab in the top left
panel showing a spreadsheet of the data.
In your teams, have a look at the spreadsheet to answer the following questions:
"National Museum of Scotland"?"O"utside attractions are there in the
"Yorkshire and the Humber" region that gives
"Members" free admission, which had more than 100,000
visitors in 2022?Once you have answered the above questions, check your answers – no cheating!
Show answers
n_2021 and
n_2022 are both numerical variables (specifically
<dbl> specifies that they are double floating point
numbers), whilst the remaining variables are text or character variables
(indicated by <chr>).Answering the above questions can be done by visually searching through a spreadsheet of the data, but this can become very tedious when the question becomes only slightly more complicated – How long did it take you to answer the last question compared to the first? Wouldn’t it be better if we could use a computer to help us.
Welcome to data wrangling!
Let’s take a look at each of the above questions in turn and see how we might answer them using R code.
How many tourist attractions are there in the data set?
Each row in the data set corresponds to a different visitor
attraction, so to answer this question we need to do is to count how
many rows there are. The simplest solution is to run the command
nrow(visitors) that calculates the number of rows there are
in the visitors data set (similarly
ncol(visitors) calculates the number of columns, i.e., the
number of variables).
Alternatively, we can use the count() command using the
%>% pipework as follows:
The count() command can also be used for creating
frequency tables by putting the name of a specific
variable into the command. For example, the following creates a
frequency table for the different admission types:
region.admission and
setting.What are the variable data types?
On viewing a printout of the data set either on the console or on the knitted document, the line immediately under the variables names indicate the data type of that variable. A detailed list of variable types can be found here.
To find this information using R code, we can use the
class() command. For example, showing that the data type of
the n_2022 variable is numerical can be achieved using the
following code.
Applying the class command to each individual variable can be very
tedious, especially if the data set contains many variables.
Alternatively, we can use the summarise_all() command that
applies a specified function to all of the columns of the data set.
Therefore, we can extract the data type for all of the variables as
follows:
Understanding the data type of each variable is important to
understand whether the R code we write will work as expected and to
determine which commands will work for the data types in the data set.
For example, the code "1" + "2" will create an error as the
addition operator expects two numerical values, but the data type of
"1" is character.
Which attraction had the most number of visitors in 2022?
This question can be addressed by ordering the data based on the
n_2022 variable. This can be performed using the
arrange() command. Note however that by default the
arrange() command orders the rows into increasing
numerical order, so that the first row corresponds to the attraction
with the least number of visits. The command desc() sorts
the data in descending order, so that now the first row
corresponds to the most visited attraction. For example:
For example:
What is the admission charge for the
"National Museum of Scotland"?
This question asks you to search through, or filter, the
data to find the row that satisfies a particular condition. The
filter() command can be used to extract all of the rows in
the data set that satisfies a specific logical test, retaining all rows
where the expression is TRUE and removing those where it is
FALSE.
There are 6 key operations when performing a logical test:
| Operation | Code | Eg, TRUE |
Eg, FALSE |
|---|---|---|---|
| Equality (\(=\)) | == |
3 == 3 |
3.14 == pi |
| Not equal (\(\neq\)) | != |
-1 != 1 |
sin(0) != 0 |
| Less than (\(<\)) | < |
2 < 3 |
4 < 4 |
| Less than or equal (\(\leq\)) | <= |
-1 <= 2 |
log(5) <= 1 |
| Greater than (\(>\)) | > |
sqrt(2) > 1.41 |
cos(pi) > sin(3*pi/2) |
| Greater than or equal (\(\geq\)) | >= |
Inf >= 10^6 |
-5 >= -2 |
The logical test that is needed to filter the data set is to assess
whether the content of the attraction variable exactly
matches the text "National Museum of Scotland".
Specifically:
How many "O"utside attractions are there in the
"Yorkshire and the Humber" region that gives
"Members" free admission, which had more than 100,000
visitors in 2022?
The previous question used the filter() command to
extract the rows based on a single logical test, but here we need to
apply 4 logical tests. The filter() command can still be
applied by writing the R code for each test separated by a comma.
We can stop there and view the result of the filtering to see which
attractions meet the 4 conditions, but the question asks for ‘How many’,
not ‘Which’. Therefore, we can pipe (%>%) the result
from the filer() command directly into the
count() command:
visitors %>%
filter(
setting == "O",
region == "Yorkshire and the Humber",
admission == "Members",
n_2022 >= 100000
) %>%
count()Coding Style: Writing all of the code for this example on a single line will work but doing so can make your code very difficult to read. This is particularly important when there is an error and you are trying to de-bug your code. Make use of ‘whitespace’ (spaces, new lines and tabbing) to aid the readability of your code.
✅ ⬆️ Now is a good time to save and commit any changes to your work, and to push them to your repository on GitHub.
In this section we will investigate how to answer the following questions:
Before continuing, have a think about how you would attempt to answer these two questions.
What are the mean and median visitor numbers in 2022 across all attractions?
R has many commands that can be used to compute a wide range of
standard statistical values. These include the mean()
command for computing the arithmetic mean or averaged value, and
median() command for computing the median or mid-point in
the data.
NA stand for and why are you getting this as
your answer to the previous question.mean() and
median() commands to see what the input argument
na.rm does. Edit your code from exercise h so that it
computes the summary statistics where data is available.Which setting (inside, outside or mixed) has the largest mean visitor numbers in 2022?
This question is a simple extension of the previous question where we
are now interested in computing the mean statistic per
setting rather than across all settings. To implement this
change, we need to group the data before calculating the
summary statistics. This is achieved by using the group_by
command as follows:
From the output you can easily read which setting, whether
"I"nside, "O"utside or "M"ixed,
have the largest
If you have time, discuss in your group why the mean values across all settings are much larger than the median values.
What is the interquartile range (the width of the middle 50% of data set between the lower and upper quartiles) the for each of the four nations of the UK?
This is a more complex question that requires some pre-processing of the data before doing any calculations.
First, look at the region variable. There are entries
for "Northern Ireland", "Scotland" and
"Wales", but England is split into 9 different regions. The
first stage is to transform the data set by creating a new variable
called nation that groups all of 9 English regions.
A transformation of the data can be performed using the
mutate() command. For this question we need to consider how
best to transform the region variable to the new
nation variable. The transformation can be described by the
following if-else function:
\[{nation} = \left\{ \begin{array}{ll}\text{Northern Ireland} & \text{if}~{region} = \text{Northern Ireland}, \text{else},\\\text{Scotland} & \text{if}~{region} = \text{Scotland},\text{else},\\\text{Wales} & \text{if}~{region} = \text{Wales},\text{else},\\\text{England} & \text{otherwise}.\\ \end{array}\right.\]
This function can be implemented using the case_when()
function. The general structure of this command is as follows:
case_when(
logical_test_1 ~ returned_value_1,
logical_test_2 ~ returned_value_2,
...
logical_test_n ~ returned_value_n,
TRUE ~ return_value_default
)On entering this command, the first logical test is performed and if
the answer is TRUE then the command returns the value after
the ~ symbol. If the result of the first test is
FALSE then the next logical test is evaluated. The
TRUE statement at the end acts as the ‘otherwise’ scenario
to capture all remaining cases that did not satisfy any of the earlier
tests.
The case_when command can be used to implement the
transformation of the region variable to construct the new
nation variable as follows:
visitors_with_nations <- visitors %>%
mutate(
nation = case_when(
region == "Northern Ireland" ~ "Northern Ireland",
region == "Scotland" ~ "Scotland",
region == "Wales" ~ "Wales",
TRUE ~ "England"
)
)Here we have assigned the result of the transformation to a new
variable in memory called visitors_with_nations so that we
can use the result without the need to continuously re-evaluate the
nation variable as part of a workflow pipework.
Finally, we can now summarise the data to compute the inter-quartile
range for each nation using the IQR() command:
✅ ⬆️ Now is a good time to save and commit any changes to your work, and to push them to your repository on GitHub.
If you have time at the end of today’s lab, have a go at answering the following more challenging exercises
That’s the end of this lab!
✅ ⬆️ Remember to commit and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files (that is, make sure every file has a tick next to it) so that your Git pane is cleared up afterwards.
Don’t worry if you did not reach the end of the worksheet today, but please try to go through any remaining exercises in your own time.